Update on Dr. AI

It has been six months since I wrote about the potential for the new transformer AI technology to serve as an expert medical system. Since then there has been a slew of studies testing the ability of ChatGPT and similar systems to make diagnoses or clinical decisions or pass standardized medical exams. The results have mostly been positive. For example, earlier this year Kung et al published a study in which they found that ChatGPT was able to pass all three parts of the United States Medical Licensing Exam (USMLE), with a borderline passing grade of 60%. There have been numerous specialty board exam studies as well, with mixed results, but with ChatGPT passing most of them.

A recent study extends this research by looking not just at medical knowledge but medical decision-making. For the study they used 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual and tested the ability of ChatGPT to generate an initial differential diagnosis, recommend clinical management decisions (such as which studies to order), and then make a final diagnosis from this information. They found:

“ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=–15.8%; P<.001) and clinical management (β=–7.4%; P=.02) question types.”

This is impressive, and does fit with prior research about the strengths and weaknesses of ChatGPT type systems. For review, ChatGPT is an open source version of what is called a large language model (LLM). The core artificial intelligence (AI) technology is called a transformer – the “GPT” stands for generative pre-trained transformer. It is generative because it is not simply copying text from some source, it is generating text based on a predictive model. It is pre-trained on a vast body of text farmed from the internet.

These LLM systems do not think, and are not on the way to general AI that simulates human intelligence. They have been compared to a really good auto-complete – they work by predicting the most likely next word segment based upon billions of examples from the internet. And yet their results can be quite impressive. They can produce natural-sounding language, and can generate an impressive base of knowledge.

But they are still brittle in the way that such narrow AI systems are brittle, meaning that if you push them they will break. For these LLM the primary weakness is that they are susceptible to so-called hallucinations. This means that they can make stuff up. Remember – they are generating text based upon probability, not actually fact-checking or reflecting accurate knowledge. Therefore, for example, if two things are statistically likely to be mentioned together, ChatGPT will generate text that makes it seem like that are directly related. It can also entirely make up plausible looking references, by generating a reference-like structure and filling it is with statistically determined but fake details.

This is a serious flaw for an expert system. To put ChatGPT’s performance on the recent study into context, it barely passed with a level of knowledge equal to an average new medical school graduate, but not a seasoned clinician. It is therefore not yet at the level of being able to practice medicine. There are two questions – will it ever be, and can it be useful in the meantime.

Taking the second question first, I think at the moment a general LLM application like ChatGPT can be somewhat useful as an expert system, meaning that it is used by as expert as a tool to help them function. But its usefulness comes with some significant cautions and caveats. The results that ChatGPT produce cannot be trusted. They should not be taken as authoritative, even if they sound that way. But they can be used as an idea generator, to suggest possible diagnoses that a clinician may not have thought of.

What about for the non-expert user? Can an average person use ChatGPT as a search engine to find reasonable answers to medical questions? The answer is similar – it’s about as good as a typical Google search, although with natural language. But there is no guarantee that the information is accurate. ChatGPT essentially just reflects the information out there on the internet, both good and bad. The ways questions are phrased will also tend to bias answers. Again, remember, ChatGPT does not think or understand (as humans do), it’s just a predictive model.

What is the potential for such systems in the future, however? I think the potential is great. ChatGPT is a general application LLM, not specifically trained as a medical expert, and yet it does fairly well. Imagine a medical-expert version of ChatGPT, not trained on the internet but trained on the totality of published medical studies, practice standards, and expert analysis. It seems likely that such an LLM would outperform ChatGPT or similar models.

Also, results can be improved by properly training the user. A recent study looked at the potential for “instruction prompt tuning”. This means creating prompts (the question you ask an LLM) that are designed to produce more reliable results. These can be based on tested exemplars. We might see a future where optimizing medical LLM prompts is a class in medical school.

There does seem to be a general consensus that these LLM AI systems have tremendous potential as medical expert systems. They are currently right on the edge of functioning basic medical knowledge, but not at the level of experienced clinicians. They also suffer from significant limitations, such as making up fake information. But it does seem that we are incredibly close to getting such systems to the point that they could significantly improve the practice of medicine. They can help reduce error and misdiagnosis, and also chart the most efficient pathway of diagnostic workup or clinical management. Medicine, ultimately, is a game of statistics, and an AI medical assistant could provide the statistical and factual information that a clinician needs at the point of patient care (one of the ultimate goals of evidence-based medicine).

A medical LLM could also help clinicians keep up to date. It is challenging, to say the least, to keep always at the absolute cutting edge of medical knowledge. The internet has made this a lot easier – a clinical can now just quickly look up a medical question and see what the latest published studies say. But the faster, more efficient, and more thorough we can make this process, the better.

There still needs to be a human in the loop (and there will be until we have general AI with full human intelligence). This is because medicine is also a human practice, and requires judgement, emotional calculations about risk vs benefit, goals of care, and a human perspective. Facts alone are not enough. But it is always best to make the human and personal medical decisions from the perspective of accurate, up to date, and thorough medical information.

Author

Steven Novella

Founder and currently Executive Editor of Science-Based Medicine Steven Novella, MD is an academic clinical neurologist at the Yale University School of Medicine. He is also the host and producer of the popular weekly science podcast, The Skeptics’ Guide to the Universe, and the author of the NeuroLogicaBlog, a daily blog that covers news and issues in neuroscience, but also general science, scientific skepticism, philosophy of science, critical thinking, and the intersection of science with the media and society. Dr. Novella also has produced two courses with The Great Courses, and published a book on critical thinking - also called The Skeptics Guide to the Universe.
View all posts

Categories

Tags

Archives

Update on Dr. AI

Author

Posted by Steven Novella